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ABSTRACT 

Motivation: Discovering the transcriptional regulatory architecture of 
the metabolism has been an important topic to understand the impli- 
cations of transcriptional fluctuations on metabolism. The reporter 
algorithm (RA) was proposed to determine the hot spots in metabolic 
networks, around which transcriptional regulation is focused owing to 
a disease or a genetic perturbation. Using a z-score-based scoring 
scheme, RA calculates the average statistical change in the expres- 
sion levels of genes that are neighbors to a target metabolite in the 
metabolic network. The RA approach has been used in numerous 
studies to analyze cellular responses to the downstream genetic 
changes. In this article, we propose a mutual information-based multi- 
variate reporter algorithm (MIRA) with the goal of eliminating the fol- 
lowing problems in detecting reporter metabolites: (i) conventional 
statistical methods suffer from small sample sizes, (ii) as z-score 
ranges from minus to plus infinity, calculating average scores can 
lead to canceling out opposite effects and (iii) analyzing genes one 
by one, then aggregating results can lead to information loss. MIRA is 
a multivariate and combinatorial algorithm that calculates the aggre- 
gate transcriptional response around a metabolite using mutual infor- 
mation. We show that MIRA's results are biologically sound, 
empirically significant and more reliable than RA. 
Results: We apply MIRA to gene expression analysis of six knockout 
strains of Escherichia coli and show that MIRA captures the underlying 
metabolic dynamics of the switch from aerobic to anaerobic respir- 
ation. We also apply MIRA to an Autism Spectrum Disorder gene ex- 
pression dataset. Results indicate that MIRA reports metabolites that 
highly overlap with recently found metabolic biomarkers in the autism 
literature. Overall, MIRA is a promising algorithm for detecting meta- 
bolic drug targets and understanding the relation between gene 
expression and metabolic activity. 

Availability and implementation: The code is implemented in C# 
language using .NET framework. Project is available upon request. 

Contact: cicek@cs.cmu.edu 

Supplementary information: Supplementary data are available at 
Bioinformatics online 

1 INTRODUCTION 

Changes with respect to environmental or genetic modifications 
lead to complex cellular responses. Standing on the top of the 
omics hierarchy, metabolomics reflects changes taking place in 
the transcriptome and in the genome. These responses are ana- 
lyzed by researchers to discover regulatory mechanisms and dy- 
namics of cells. Transcriptional responses of the cells, along with 
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the corresponding metabolic alterations, have been investigated 
in various contexts. Some examples are plant research (Brosche 
et al, 2005; Carari et al, 2006), diabetes (Ferrara et al, 2008; 
Zelezniak et al, 2010), insulin resistance (Jans et al, 2011) and 
cancer (Schramm et al, 2010). Functional class-based (Ger stein 
and Jansen, 2000; Hughes et al, 2000; Karp et al, 2002; Pavlidis 
et al, 2002; Seshasayee et al, 2009) and protein-protein inter- 
action network-based analysis of gene expression data 
(Chowdhury et al, 2010; Chuang et al, 2007; Goh et al, 2007; 
Ideker et al, 2002; Rhodes and Chinnaiyan, 2005) have been well 
established. Metabolic network- and metabolic pathway-driven 
analysis of transcriptional data have been receiving attention 
lately. Efforts have centered on discovering transcriptional regu- 
lation architecture of metabolic networks of organisms using 
genome -wide association studies (David et al, 2006; Ihmels 
et al, 2004; Kharchenko et al, 2005; Tanay et al, 2004). 
Various methods with different goals have been developed to 
use transcriptomic and metabolic data together in the context 
of a metabolic network such as (i) mining for new metabolite- 
gene/transcription factor relationships (Ideker et al, 2001; Yeang 
et al, 2006), (ii) flux balance analysis and constraint-based mod- 
eling of organisms (Covert and Palsson, 2002a, b; Shlomi et al, 
2008) and (iii) using metabolic network topology to identify sig- 
nificant changes in related groups of genes (Cakir et al, 2006; 
Deo et al, 2010; Dinu et al, 2007; Hancock et al, 2012; Nam 
et al, 2009; Oliveira et al, 2008; Patil and Nielsen, 2005; 
Subramanian et al, 2005; Ulitsky and Shamir, 2009). 

Network topology-based analysis of biological data is a broad 
research area (Ma'ayan, 2008). In the context of metabolic net- 
works and transcriptomics, the literature so far can be divided 
into two subcategories. The first type of analysis uses predefined 
metabolic pathways as targets for transcriptional regulation and 
analyzes the changes in the pathways. Gene Set Enrichment 
Analysis (GSEA) is the first and most established analysis in 
this subcategory (Subramanian et al, 2005). Improvements 
have been proposed in the literature to eliminate the shortcom- 
ings of the GSEA approach (Dinu et al, 2007; Draghici et al, 
2007; Hancock et al, 2012). The second type of analysis con- 
siders the metabolic network as a whole, and aims to find signa- 
tures or hot spots in the metabolic network that are subject to 
transcriptional regulation (Cakir et al, 2006; Nam et al, 2009; 
Patil and Nielsen, 2005; Schramm et al, 2010). The most estab- 
lished method in this group is the reporter algorithm (RA; Patil 
and Nielsen, 2005). Using the z-score-based method introduced 
before (Ideker et al, 2002), the algorithm aims to find metabol- 
ites around which transcriptional regulation is centered and link 
the complex transcriptional motives to metabolome. Various 
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extensions and modifications to the algorithm have been pub- 
lished in the literature. For instance, same idea has been used to 
discover reporter reactions (Cakir et ai, 2006). A similar z-score- 
based approach has also been used analyze the rate-limiting steps 
of pathways (Nam et al., 2009). In this article, we focus on the 
original RA (Patil and Nielsen, 2005) and its scoring mechanism 
(Ideker et al, 2002). Our observations, discussed next, also apply 
to the extensions of the RA method. 

RA first maps the differential data onto the enzymes (reac- 
tions) in the metabolic network, and then calculates the P- values 
for each gene, using student's Mest. P-values are then converted 
to z-scores using the inverse normal cumulative distribution 
(0 -7 ). Equation (1) shows how each p f is converted to the cor- 
responding z- score z t . Given all samples, z-score measures how 
many standard deviations away a P-value is. It ranges from 
negative to positive infinity, where negative infinity corresponds 
to no significance at all. 



Zi =o-\\- Pi ) 



(1) 



The z-score z m for a metabolite m is the aggregation of z-scores 
of the k enzymes that are neighbors of m in the metabolic net- 
work (through metabolic reactions), and calculated as shown in 
Equation (2). The null hypothesis is that the genes adjacent to a 
metabolite display their normalized average response by chance. 
Should the significance of a metabolite need to be determined, 
z-score is normalized and converted back into the corresponding 
P- value. 



1 
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Although RA has been shown to be effective in several different 
contexts such as analysis of Type 2 Diabetes (Zelezniak et al, 
2010), genome scale analysis of organisms (David et al, 2008; 
Usaite et al, 2009), genome scale analysis of cell lines (Agren 
et al, 2012), gene knockout analysis (Holm et al, 2010; Cimini 
et al, 2009), targeted pathway analysis (Vongsangnak et al, 
2009), there are some shortcomings that may affect the accuracy 
of the algorithm. First, the z-score method uses student's Mest to 
measure the amount of change between two variables. However, 
the resulting P- value is highly dependent on the degrees of free- 
dom. It has been shown that, owing to the small number of 
samples in cohorts, P-values may not work as intended (Cicek 
et al, 2013). Second, RA uses a univariate approach. That is, it 
determines the changes per gene, and does not take dependencies 
among genes into account. For a reaction associated with mul- 
tiple genes, the gene with the highest z-score is used, and the rest 
are discarded (Zelezniak et al, 2010). 

In Figure 1A, we illustrate the problem via an example. The 
figure shows an example from the application of the RA algo- 
rithm to compare AarcAAfnr and wild-type (WT) strains of 
Escherichia coli (aerobic) in the gene expression dataset (Covert 
et al, 2004) (please see Section 2 for details). Genes (pentagons) 
are assigned z-scores first, and then the maximum z-score is se- 
lected and assigned as the z-score of the reaction (rectangles). 

For the reaction copragon transport via ton system, the max- 
imum z-score belongs to the gene fhuE(0.$$). This scoring ignores 
the contribution of the gene tonB, as any z-score value for tonB in 
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Fig. 1. Application of RA in comparison of AarcAAfnr and WT strains 
of E.coli (aerobic). Rectangles represent reactions, pentagons represent 
genes and the circle represents the metabolite, copragon. (A) Reactions 
transfer copragon from extracellular space to periplasm, and then to 
cytoplasm. The maximum change (z-score) for the genes of copragon 
transport via ABC system is —0.1, and the genes of copragon transport 
via ton system is 0.88. Aggregate z-score for the metabolite coprogen is 
assigned as 0.54. (B) When genes of copragon transport via ABC system 
are considered together they return MI of 0.792, and MI for the genes of 
copragon transport via ton system is 0.808. Average turns out to be 0.8, 
which is a relatively high mutual information value. Result is different 
than the prediction made by RA 

the range [—Infinity, 0.88) would yield the same z-score for the 
reaction. Finally, the method is additive in aggregating z-scores of 
each neighboring reaction to determine the z-score of a metabolite 
[as shown in Equation (2)]. However, z-score ranges from nega- 
tive to positive infinity, negative infinity representing the most 
insignificant case. Therefore, averaging individual results intro- 
duces the problem of opposite signs cancelling each other out. In 
Figure 1A, copragon is assigned a z-score: 0-1 + 0.88) = 0.54. 
Negative z-score (—0.1) assigned to the reaction copragon trans- 
port via ABC system partially cancels out the z-score of the reac- 
tion copragon transport via ton system (0.88). The problem would 
have been resolved if the scoring mechanism used, assigned zero 
to the most insignificant case (P = 1). 

In this article, we present a new algorithm, called Mutual 
Information-based Reporter Algorithm (MIRA) that addresses 
the shortcomings of the original RA. MIRA is a multivariate 
and combinatorial algorithm that calculates the aggregate tran- 
scriptional response around a metabolite using mutual 
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information. Mutual Information is an information theoretic 
method that measures how much knowing one variable reduces 
the uncertainty about the other. In gene expression analysis, it has 
been used frequently (Butte et aL, 2000; Chowdhury et aL, 2010; 
Gupta et aL, 2010; Steuer et aL, 2002; Zhang et aL, 2010). 
Chowdhury et aL show that combinatorially dysregulated subnet- 
works found through mutual information is predictive for cancer. 
In metabolomics analysis, using the combinations of metabolites 
and their levels, ADEMA (the algorithm for determining expected 
metabolite alterations) predicts expected changes of metabolite 
levels in a given metabolic network using mutual information 
(Cicek et aL, 2013). In the context of MIRA, the two variables 
are (i) the genotype (e.g., case and control) and (ii) the combin- 
ations of the genes associated with a reaction and their expression 
levels. More specifically, MIRA performs the following tasks: (i) 
discretize gene expression levels using B-spline functions, (ii) cal- 
culate the mutual information between the genotype, as well as 
combinations of genes associated in a reaction and their discre- 
tized expression levels, and (iii) calculate the average mutual in- 
formation for each metabolite using the mutual information of 
each neighboring reaction. Figure 2 shows the overall flow of the 
algorithm. 

The advantages of MIRA over RA are as follows: 

(1) Combinatorial mutual information works well when the 
sample sizes are small, and performs better than univariate 
significance testing. 



(2) Unlike the RA, MIRA uses a multivariate method, ana- 
lyzing multiple genes at a time. Therefore, it does not dis- 
card less insignificant changes unlike RA. MIRA uses 
measurements of individual samples instead of comparing 
sample means and is able to capture linear and non-linear 
dependencies among variables. 

(3) Mutual information is bounded by zero and the minimum 
of the entropies of the two random variables. The most 
insignificant case is assigned the score zero; therefore, in- 
significant changes do not cancel out significant changes. 

(4) MIRA has no bias toward highly connected metabolites, 
as it normalizes the sum of changes around a metabolite 
using the number of reactions instead of the square root 
of number of reactions as RA does [see Equations (2) 
and (8)]. 

Figure IB shows the results obtained by MIRA for the intro- 
ductory example shown in Figure 1A. Unlike the RA, which 
assigns a low score to the reaction copragon transport via ABC 
system indicating that there is no significant change on this 
enzyme, MIRA predicts relatively high mutual information indi- 
cating that when considered together genes fhuB, fhuC and fhuD 
are expressed differently. MIRA predicts similar results for both 
reactions and the average is found as 0.8 (max 0.98 in this test), 
whereas the RA assigns 0.54 to copragon (max ~13 in this test). 
The difference between the two algorithms stems from the fact 
that (i) MIRA performs a multivariate analysis compared with 
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Fig. 2. Algorithm Flow. Rectangles represent reactions, pentagons represent genes and circles represent metabolites (darker red represents higher 
average mutual information). Algorithm starts by generating gene-reaction and reaction-metabolite associations out of the SBML file of the recon- 
structed metabolic network. Next, for each metabolite, gene sets are constructed based on their association with the neighboring reactions. As the third 
step, gene expression levels are discretized using B-splines. Fourth step consists of calculating the mutual information between the class variable and the 
discretized expression levels of groups of genes. After each metabolite is assigned average mutual information, based on the calculations done in Step 5, 
metabolites are ranked based on their average mutual information, and reporter metabolites are determined (darker red means it is a reporter metabolite) 
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the univariate analysis done by RA, (ii) given that there are 
only seven samples (3 AarcAAfnr and 4 WT bacteria), mutual 
information is able to capture the change better than z-scores 
and (iii) being non-negative, mutual information does not cancel 
out significant effects. 

To evaluate MIRA, we have analyzed six strains of E.coli with 
knockouts of transcriptional regulators in the oxygen response 
(AarcA, AappY, Afnr, AoxyR, AsoxS and AarcAAfnr) in aerobic 
and anaerobic conditions (Covert et al, 2004). We have used the 
reconstructed metabolic network of E.coli (iAF1260; Feist et al, 
2007). We have also analyzed the autism gene expression dataset 
(Voineagu et a/., 201 1) using the Recon 1 genome scale metabolic 
network for humans (Duarte et al, 2007). For the E.coli dataset, 
we focused on the Afnr knockout, which affects the switch be- 
tween aerobic and anaerobic respiration. MIRA was able to suc- 
cessfully capture metabolites that were closely related to this 
enzyme and anaerobic respiration mechanism. For the autism 
dataset, MIRA predicted metabolites that have been recently 
discovered in the autism literature. We have also shown that 
MIRA has no bias toward hub metabolites unlike RA, and scor- 
ing scheme for MIRA is empirically significant. 



B-spline functions are defined by parameters M and k where M is the 
number of bins (chunks) that the domain is going to be divided into, and k, 
1 < k < M is the number of bins an observation can be assigned to (e.g. 
k = 1 is equivalent to histogram-based binning). Based on M and k, 
each B-spline curve i, 1 < i < M, is assigned a basis vector, the so-called 
knot vector t h defined as in Equation (3), and then the curve i is defined as 
a function of the neighboring curves, as shown in Equations (4) and (5). 
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Each measurement (e.g. expression level for gene gn for person x) is as- 
signed to a bin i with probability B t ^(§(gn)), where §(gn) is a function that 
normalizes the value of the measurement for gn using the maximum and 
the minimum values observed for gn in the dataset. 



2 MATERIALS AND METHODS 

This section describes subcomponents and techniques MIRA uses to 
detect the hot spots in the metabolic network. Supplementary Table SI 
describes the abbreviations and notations used throughout the section. 
Please see Supplementary Appendix A, Table SI, for a list of terms and 
variables and their explanations. 

2.1 Constructing the network 

In the context of our algorithm, the network is a hyper-graph G(V,E) 
where the vertex set V is the union of three entity types: metabolites, genes 
and reactions. The edge set E contains two types of edges: (i) edges that 
connect reactions to the associated genes whose expression lead to the 
corresponding enzyme and (ii) edges that connect metabolites to asso- 
ciated reactions based on producer/consumer relationships. Network in- 
formation is obtained through the genome-scale reconstructed metabolic 
network of the organism to be analyzed using an SBML parser. 

2.2 Discretizing gene expression data 

To calculate mutual information between the genotype and the gene ex- 
pression observations, one needs to calculate the probability of observing 
that profile. This is a well-studied problem in the literature (Silverman, 
1986). There are three techniques that do not assume that the values come 
from a known distribution (which is the case for gene expression data). 
Kernel Density Estimation aims to measure the density of observations 
falling into a predetermined window using a kernel (Moon et al, 1995), 
though it suffers from the high computational cost, and the results are 
dependent on the kernel length. Second technique is the histogram-based 
classification, which determines thresholds by dividing the domain of the 
variable into equal-sized chunks and classifying observations based on 
these thresholds. Despite the low computational cost, observations that 
are close to the thresholds are likely to be misclassified, as analytical 
methods associate an error term with each observation (Cakmak et al., 
2012; Cicek and Ozsoyoglu, 2012). To fix this shortcoming, B-spline 
functions have been used to associate probabilities with each bin deter- 
mined by the histogram-based approach (Cicek et al, 2013; Faith et al., 
2007). That is, each observation is associated with a probability to be in a 
bin, instead of making a binary decision to determine if it is in a given bin 
or not. 



2.3 Calculation of mutual information 

After expression values are calculated, mutual information is calculated. 
Given two random variables X and Y, mutual information I( X; Y) meas- 
ures how much knowing one reduces the uncertainty about the other. 

In the context of MIRA, the first variable is the binary class variable C 
[e.g. control versus variable) and the second variable B(G(r)) is the 
binned measurements of a group of genes G(r) that are associated with 
a reaction r. For instance, for the example shown in Figure 2B, genes 
fhuE and tonB are grouped based on their association with the reaction 
coprogen transport via ton system ( ctts). Assuming we use 2 bins (up and 
down) then, C = {WT, AarcAAfnr] and B(G(ctts) ) = {fhuE up & tonB 
up,fhuE up & tonB down,fhuE down & tonB up,fhuE down & tonB down). 
I(C;B(G(ctts) ) ) is found to be 0.808. 

The goal of calculating I( C;B( G(r))) is to learn how much knowing 
discretized gene expression levels of related genes reduces the uncertainty 
on the genotype. In other words, we find out whether a reaction r and its 
corresponding genes are predictive on the genotype. If the expression 
levels of these genes (when considered together) are different with respect 
to the class variable, then we obtain a high mutual information value. 
Equation (6) specifies the mutual information formula. 

/(C; *<*,))) - I^L^ c) * l g (|H) (6) 

p(bg) is calculated as shown in Equation (7) and k is a constant input to 
the algorithm. That is, given a dataset D, probability of observing g (e.g. 
fliuE up and tonB up) is the multiplication of spline values for each gene 
(e.g.fhuE and tonB) to be in the corresponding bins (e.g. up and up in this 
case), summed and averaged over all individuals in D. This calculation 
assumes that gene expression values are independent of each other. 
p(bg,c) is calculated similarly and p(c) is constant in the dataset. 

P(bg) = " — (7) 

We define the aggregate transcriptional regulation around a metabolite 
m as the average mutual information of the consumer and producer reac- 
tions of m. Given that R(m) is the set of neighboring reactions, 
then average mutual information I m for metabolite m is defined as in 
Equation (8). 
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MIRA fits a beta distribution to I m values given the interval. Then, 
I m ~ Beta(a, /3) where a and are learned from the sample. Finally, all 
metabolites with P<0.05 are picked as reporter metabolites. 



2.4 Datasets and experimental design 

To test the performance of MIRA and compare it with RA, we con- 
sidered two resources. First, we used mRNA expression profiles of six 
Exoli strains with knockouts of transcriptional regulators of the oxygen 
response (AarcA, AappY, Afnr, AoxyR, AsoxS and AarcAAfnr) pub- 
lished and released in Covert et al. (2004). For each strain they obtained 
measurements in aerobic and anaerobic conditions. Consequently, we 
used 12 datasets for the knockouts and 12 for the WT. We compared 
the expression profiles of selected knockouts against the control, to 
obtain reporter metabolites using each respective method. We did not 
perform cross-condition comparisons. For instance, we compared knock- 
out measurements under aerobic conditions with WT measurements 
under aerobic conditions only. Second, we ran tests on the autism spec- 
trum disorder (ASD) brain gene expression dataset (Voineagu et al, 
2011). The dataset contains gene expression levels for 8858 genes for 58 
human cortex samples (29 ASD and 29 controls). 

We implemented MIRA in C# language using ET Framework 4.0. 
Tests were run on a Dell PowerEdge R710 Server with two Intel® 
Xeon® quad processors and 48 GB main memory. The server runs on 
Windows Server 2008 operating system. 

For E.coli knockout tests, we used the genome scale-reconstructed 
metabolic network model of E.coli, iAF1260 (Feist et al, 2007). The 
model contains 1972 metabolites, 2382 reactions, 1261 genes and 3 com- 
partments. We mapped the measured genes to the corresponding reac- 
tions in the metabolic network, as annotated in the model. After the 
mapping, we obtained 1643 metabolites, to which there was at least 
one reaction associated with a gene measured in the dataset. Only the 
obtained 1643 metabolites were considered in the tests. For the Autism 
dataset, we used the Reconl genome-scale metabolic network for humans 
(Duarte et al., 2007). The model consists of 3188 metabolites, 3742 reac- 
tions, 1499 genes and 8 compartments. Mapping the genes measured to 
the network resulted in a metabolite set of size 2331 for further consid- 
eration. For both datasets, we used M = 6 and k = 4 to discretize meas- 
urements using B-splines. Please see Supplementary Appendix B for time 
requirements. 



3 RESULTS 

3.1 Comparing reporter metabolite sets of MIRA and RA 

To compare the performances of MIRA and RA, we obtained 
reporter metabolites using both algorithms. First, we investigated 
how similar two sets are using the Jaccard distance (JD). JD is 
complementary to the Jaccard index, which is the ratio of shared 
items between the two sets, A and B, and the union of the items 
in two sets. Jaccard index is denoted as J(A,B). JD, J S (A,B), is 
equivalent to 1 —J(A,B). Two algorithms yield different sets of 
reporter metabolites for E.coli knockouts. For the 
knockouts AappY (anaerobic) and AOxyR (aerobic), the sets 
of reporter metabolites are totally distinct (JD =1) and the 
smallest JD obtained in these tests is 0.85. For the Autism data- 
set, JD is 0.9. 



3.2 Robustness against hub metabolites 

Metabolic networks are known to be scale free (Albert, 2005), 
that is, their degree distribution follows the power law. Our re- 
sults show that the reporter metabolites found by RA are usually 
highly connected metabolites known as hubs or common metab- 
olites, which are rare in scale-free networks. For instance, in the 
test for Afnr (aerobic), RA marks H + , H 2 0, ATP, ADP, 
Phosphate, Diphosphate and C0 2 as the top 7 reporter metab- 
olites, which participate in many reactions and are associated 
with many genes in the metabolic network. To test if RA has a 
bias toward hub metabolites, we calculated the average gene 
connectivity and reaction connectivity of the reporter metabolites 
for each algorithm. We also formed a random set of metabolites 
where the size of the random set is randomly chosen as a number 
between the sizes of reporter sets produced by MIRA and RA. 
We repeated the random set selection 10 000 times and found the 
average connectivity for the random set. Figure 3A shows the 
average gene connectivity, and Figure 3B shows the average re- 
action connectivity of the reporter metabolites for MIRA, RA 
and the random set for E.coli knockout tests. Results show that 
RA favors highly connected metabolites, whereas reporter me- 
tabolites found by MIRA have a closer degree distribution to the 
random set, and does not have such a bias. This also applies to 
reporter metabolites found in the Autism dataset by RA 
(Supplementary Table S5). The intuitive reason for this differ- 
ence is that MIRA averages the mutual information found for 
each reaction, whereas RA divides the sum by the square root of 
the number of reactions around a metabolite. 

3.3 Interpreting reporter metabolites for Afnr 
(anaerobic) dataset 

When E.coli has no 0 2 as the final electron acceptor, it switches 
to anaerobic respiration and uses electron donating dehydrogen- 
ases and accepting reductases on the membrane. Fumarate 
Nitrate Reductase (fnr) is a transcriptional regulator that regu- 
lates 100+ genes and nitrate/fumarate reduction in response to 
the switch from aerobic to anaerobic respiration in E.coli. Fnr 
has an iron-sulfur cluster [4Fe-4S] that senses the presence of 
oxygen and becomes inactivated when oxidized in the presence 
of oxygen. It can also be converted into a disulfide form by 
glutathione or thioredoxin when inactive (Daruwala and 
Meganathan, 1991). 

Figure 4 shows a simplified depiction of the dynamics in 
anaerobic respiration with respect to fnr (Keseler et al, 2013; 
MetaCyc; Unden and Bongaerts, 1997; UniProt). Although 
there are many types of dehydrogenases and reductases, we 
draw them in two groups for the sake of simplicity: (i) hydrophilic 
side toward periplasm and (ii) hydrophilic side toward cytoplasm. 
Electron donors like G3P, formate, lactate, NADH, H 2 are oxi- 
dized by the hydrogenases, and electrons are transported using 
menaquinone to the reductases. Focusing on nitrate reductase, 
this electron is transferred to protoheme, then to Fe-S cluster 
and, finally, to molybdenum (Mo). It is used to reduce nitrate 
to nitrite. Fnr stimulates the expression of this enzyme. In the 
presence of 0 2 , fnr is oxidized and inactivated. As stated above, 
glutathione and thioredoxin act as electron donors to activate fnr. 

In Supplementary Appendix C, Table S2 lists the reporter me- 
tabolites found by MIRA and Table S3 lists reporter metabolites 
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Fig. 3. Application of Average gene and reaction connectivity of the reporter metabolites. Panel (A) shows average number of genes associated with the 
reporter metabolites found by RA and MIRA in E.coli knockout tests. Random set reports the average number of genes connected randomly chosen 
metabolites of the same size as the original sets (repeated 10000 times and averaged). Panel (B) shows the same results for the average number of 
reactions associated with metabolites 



found by RA based on Afnr knockout under anaerobic condi- 
tion. Previous description is based on the data obtained from 
UniProt Database, which summarizes the key concepts and me- 
tabolites related to fnr. The metabolites reported by MIRA 
highly overlap with this definition. The second reporter metab- 
olite protoheme (heme b) is an important metabolite in the heme- 
biosynthesis pathway, and is the first electron acceptor in nitrate 
reductase complex (Metacyc). As mentioned above, glutathione 
and thioredoxin are key agents to convert the enzyme, and 
MIRA detects them as reporter metabolites [glutaredoxin 
(reduced/oxidized) and thioredoxin (reduced/oxidized)]. Nitrite 
is one of the direct products of nitrate reductase, and it is also 
reported as a reporter metabolite. 

Aside from the metabolites in UniProt's definition, MIRA 
found dimethyl sulfide/sulfoxide (DMSO) and trimethylamine/ 
trimethylamine n-oxide (TMAO) as reporter metabolites. As 
Figure 4 shows, E.coli uses N- and S- oxides as the terminal 
electron acceptors (Daruwala and Meganathan, 1991). 

When RA is considered, the top metabolite picked by MIRA 
is not a reporter metabolite in RA's list. Tungstate is known as a 
direct inhibitor of nitrate reductase as well as TMAO and 
DMSO reductases, which are also reporter metabolites (Prins 
et aL, 1980). Similarly, tripeptide murein units [short name for 
two linked disacharide tripeptide murein units (uncrosslinked 
middle of chain)} is also not picked by RA and might seem un- 
related at first. Murein units constitute bacterial cell walls. 
Membrane-bound lytic murein transglycosylase is the enzyme 
that degrades mureins, and the gene responsible to transcribe 
this enzyme is dniR (mltD). A mutant of this enzyme is known 
to be defective in producing nitrite reductase. Nitrate and nitrite 
also stimulate the expression of dniR (Kajie et aL, 1991). 

Most striking difference in the reporter metabolites found by 
RA is that the first seven metabolites are H + , H20, ADP, P, 
ATP, NAD, NADH (NADP, NADPH, 0 2 , CDP, UDP, dADP, 
dCDP, GDP, GTP, CoA are also listed). These are highly 



connected metabolites, which take place in many biochemical 
activities in the cell, and therefore, it is hard to link them to 
the effect of the knockout in this test. MIRA puts these metab- 
olites within (695 964] in the ranking of 1643 metabolites con- 
sidered. Also RA lists many metabolites from the central 
metabolism: Glycerophosphoglycerol, pyruvate, glycerol, F6P, 
succinate d-glucose and acetyl-CoA. Similarly, these metabolites 
can be considered as general owing to the diverse functionality of 
the pathway they are in. Having that said, RA detects some 
metabolites that are not picked by MIRA and are relevant. 
For instance, sulfate and molybdate are incorporated into fnr 
and nitrate reductase (Tavares et aL, 2006). Menaquinone 8, 
menaquinol 8 and 2-demethylmenaquinol 8, link dehydrogenases 
and reductases/oxidases in the electron transport chains, and 
hence are directly related to fnr enzyme as shown in Figure 4 
(Unden and Bongaerts, 1997). Although ubiquinone-8/ubiqui- 
nol-8 play an important role in aerobic respiration, they are 
listed higher than menaquinone 8/menaquinol 8. 

In conclusion, the results suggest that (i) MIRA has no bias 
toward hub metabolites, and successfully downplays their im- 
portance, and (ii) MIRA yields reporter metabolites, which are 
in close proximity to the enzyme and are relevant with respect to 
the literature. 

3.4 Interpreting reporter metabolites for the 
autism dataset 

ASD is a developmental genetic disorder that causes social inter- 
action abnormalities, communication deficiencies and repetitive 
behavior. Although the disease has a genetic origin, metabolic 
implications have been studied widely in the literature (Boccuto 
et aL, 2013; Emond et aL, 2013; Yap et aL, 2010). Running 
MIRA on the gene expression dataset provided by Voineagu 
et aL, 2011, has resulted in 52 reporter metabolites as shown in 
Supplementary Appendix C, Table S4. Reporter metabolites 
found by RA is listed in Supplementary Appendix C, Table S5. 
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Fig. 4. Anaerobic respiration of E.coli with respect to fnr and reporter metabolites found by MIRA. Anaerobic respiration of E.coli couples electron 
donors to electron acceptors via dehydrogenases and reductases on the inner membrane. There are many types of dehydrogenases and reductases; 
however, only two types are shown: hydrophilic side toward cytosol and toward periplasm. Sizes and shapes of the proteins are not drawn to scale. Only 
nitrate reductase is shown in more detail and separately. Formate, lactate, NADH, H 2 G3P are electron donors and lead to reduction of acceptors like 
nitrate, nitrite, DMSO, TMSO and fumarate. Menanquinone acts as a mediator between dehydrogenases and reductases. In the case of nitrate reductase, 
electron is transferred through protoheme, Fe-S cluster and Molybdenum to reduce nitrate to nitrate. Fnr activates this enzyme, and tungstate is a well- 
known inhibitor. Fnr is inactivated by oxygen and can be reactivated by agents like glutaredoxin and thioredoxin. Murein units constitute the cell wall, 
and the enzyme that degrades murein units is transcribed by dniR gene. It is known that dniR regulates nitrite reductase and it is stimulated but nitrite 
and nitrite 



The first and the eighth metabolites located as reporter by 
MIRA are protein- linked serine/threonine residue and protein- 
linked asparagine residue at glycosylation sites. Glycosylation 
is a process that attaches glycans to proteins as a post-trans- 
lational modification in the secretion pathway. Secretion path- 
way is known as an important factor in brain development, 
and DIA1R mutation leads to ASD and mental retardation 
(Aziz et aL, 2011). More specifically, it is reported in the 
literature that 7 glycosylation-related genes are affected by 
copy number variations (CNVs) in autism patients (van der 
Zwaag et aL, 2009; Pinto et aL, 2010). We also observe 
glycans such as glycophosphatidylinositol later in the reporter 
list. 

Glucuronidation is an important process for detoxification of 
most xenobiotics by making such components more water- 
soluble and less toxic by attaching glucuronate to the substrates 
(Stein et aL, 2011). The second metabolite d-glucorate and the 
third metabolite d-glucurono-6,3-lactone (d-glucurone), located 



by MIRA, are direct precursors of glucuronate. Stein et aL 
(2011) reports evidence on lower glucuronidation levels in chil- 
dren ASD, which may explain d-glucurono-6,3-lactone being a 
stress point in the metabolic network. L-Arabinose is also in this 
pathway and is located as a reporter by MIRA. 

Histone n6-methyl-l-lysine, protein n6,n6-dimethyl-l-lysine, 
peptdyl-l-lysine and protein n6,n6,n6-trimethyl-l-lysine, all 
located by MIRA, belong to lysine degradation pathway. 
These metabolites are centered on the path that consumes 
lysine and produces carnitine. Celestino-Soper et aL have 
revealed that dysregulation of carnitine metabolism may be im- 
portant in non-dysmorphic autism. The results show that a de- 
letion in TMLHE gene on this pathway has a significant 
correlation with ASD (Celestino-Soper et aL, 2012). In a recent 
work, Frye et aL (2013) links the abnormalities in acyl-carnitine 
levels and autism. 

Lipoylprotein, lipoamide, dihydrolipolprotein and dihydroli- 
poamide, all located by MIRA, are four metabolites that are in 
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Fig. 5. Empirical testing of the significance of the scoring schemes. Panel (A) shows series of z-scores obtained for each random set, and for the original 
dataset (shown as the red line with large circles) by RA. Scores are sorted in descending order. Panel (B) shows the I m s for the same data obtained by 
MIRA. Panels (C) and (D) show close-ups for the first 50 metabolites for Panels (A) and (B), respectively 



the glycine cleavage pathway. MIRA reports four metabolites 
out of six in this pathway and points to an alteration in 
this process. A recent work by Yu et al. (2013) confirms that a 
mutation in AMT gene is also associated with ASD. This muta- 
tion leads to a deficiency in glycine cleavage system. In addition 
to this, the metabolic profiling done by Yap et al. (2010) shows 
significant differences in glycine levels in the urine of autistic 
children. 

Starting with stearidonyl coenzyme A, MIRA detects 17 inter- 
mediates of fatty acid synthesis pathway as reporter metabolites. 
Fatty acid metabolism and autism have been associated in the 
literature. Richardson and Ross point to the growing evidence 
on the relation between neurodegenerative diseases and fatty acid 
abnormalities (Richardson and Ross, 2000). Among more recent 
works, Tamiji and Crawford state that children with autism 
show higher rates of lipid metabolism than controls (Tamiji 
and Crawford, 2010). El-Ansary et al. (2011) report increase in 
most of the saturated fatty acids in a cohort of 52 autism 
patients. 

In comparison, none of the aforementioned reporter metabol- 
ites are located by RA, but common metabolites are highly 
ranked as in E.coli tests. 

In summary, reporter metabolites picked by MIRA for autism 
are backed by literature. Results show that, although some pre- 
dictions (e.g. glycine cleavage deficiency) are not obvious targets, 
the MIRA method was able to predict them. The literature on 
the relation between autism and (i) glucuronation (Stein et al, 
2011), (ii) lysine degradation and carnitine metabolism 
(Celestino-Soper et al, 2012; Frye et al, 2013) and (iii) glycine 
cleavage (Yu et al, 2013) did not exist at the time of the gene 
expression dataset was published. Hence, MIRA shows promis- 
ing prospect for discovering new metabolic targets. 

3.5 Empirically testing the significance of scoring 
schemes used by MIRA and RA 

To assess the significance of the scores calculated by both algo- 
rithms and assess the reliability of the rankings, we used an 



empirical significance testing using the following method. We 
(i) shuffled the labels of the individuals in the autism dataset 
100 times to obtain 100 random datasets, (ii) ran MIRA and 
RA on these datasets, as well as on the original data, and (iii) 
sorted and plotted the scores in descending order for all 101 
instances. Figure 5 A shows the scores for RA, and Figure 5B 
shows the results for MIRA. Big red circles represent the results 
found on the original dataset. Figures 5C and 5D show a close- 
up for the first 50 metabolites in the ranking. MIRA's results for 
the original dataset dominate the random curves and, hence, 
suggest an empirically significant result. On the other hand, 
RA's output for original data follows a similar pattern with 
the random datasets. 



4 CONCLUSION 

Metabolic networks have received significant attention in the 
past decade. This advancement has led to the investigation of 
various genetic diseases and their metabolism with the use of the 
reconstructed genome scale metabolic networks. One application 
is to find the regulatory architecture of the metabolic network 
using the underlying transcriptome. The RA finds the metabolic 
hot spots around which transcriptional regulation is centered. In 
this article, we developed a novel method, called MIRA, that 
uses a combinatorial approach and mutual information to find 
reporter metabolites. Our approach addresses the shortcomings 
of the existing RA algorithm. More specifically, it is robust 
against small sample sizes, uses a multivariate approach instead 
of a univariate one and does not cancel out significant changes in 
expression levels with non- significant ones. Our results show that 
(i) MIRA has no bias on picking hub metabolites as reporter 
metabolites, (ii) reporter metabolites found by MIRA are bio- 
logically sound and are supported by literature even for a com- 
plex disease like Autism, (iii) MIRA captures the effects of a 
knockout of the Fnr gene in E.coli successfully and (iv) MIRA 
provides empirically significant results, which supports the fact 
that it captures the underlying biological phenomenon. 
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